Integrating compile-time and runtime parallelism management through revocable thread serialization
نویسنده
چکیده
Efficient decomposition of program and data on a scalable MIMD processor is necessary in order to minimize communication and synchronization costs while preserving sufficient parallelism to balance the workload of the processors. Programmer management of these runtime aspects of computation often proves to be very tedious and error-prone, and produces non-scalable, non-portable code. Previous efforts to apply compiler technology have assumed static timing of runtime operations, but real MIMD processors operate with much asynchrony that is difficult for the compiler to anticipate. Also, real programming environments support features that prevent the compiler from having total knowledge of and control over the runtime conditions. Most work recognizing the asynchronous behavior of MIMD machines produce many fine-grained tasks and rely on special hardware support for fast and cheap task creation, synchronization, and context switching. The runtime mechanisms prove to be still expensive and cannot adequately improve the performance of massively data-parallel computations, whose dominant communication and synchronization overhead occurs at the data array references. Efficient decomposition of these programs requires information available from compiler analysis of the source program. We explore a framework for integrating compile-time and runtime parallelism management, whose goal is to have the compiler produce independently-scheduled, asynchronous tasks that will cooperate flexibly and efficiently with a runtime manager to distribute work and allocate system resources. The compiler analyzes the source program to partition parallel loops and data structures, then aligns thread and data partitions from various loops and other parts of the program into groups called detachments to maximize data reference locality within each detachment. The threads in a detachment are then provisionally serialized to form a single-threaded brigade. In the nominal case each brigade is assigned to a processor and executed by a task to completion in the statically scheduled order; its coarse granularity thus saves much communication, synchronization, and task management overhead. Should runtime load balancing be required, program threads can be split off from a brigade at points prioritized by the compiler and executed by tasks on idle processors. Also, if the pre-scheduled thread ordering becomes incompatible with actual runtime dependencies, a deadlock condition is averted by initiating work on a thread split from the blocked task before returning to the blocked task. We expect an overall performance gain when the compiler chooses the approximately correct granularity and thread ordering, so these exceptional cases are in fact relatively rare. We implement this new framework and examine potential policies controlling its compiler choices, and compare its performance data with alternative parallelism management strategies. Thesis Supervisor: Anant Agarwal Title: Associate Professor, EECS
منابع مشابه
Support for Thread-Level Speculation into OpenMP
– In-depth knowledge of the problem. – Understanding of the underlying architecture. – Knowledge on the parallel programming model. • OpenMP allows to parallelize code “avoiding” these requirements. • Compilers’ automatic parallelization only proceed when there is no risk. • Thread-Level Speculation (TLS) can extract parallelism when a compile-time dependence analysis can not guarantee that the...
متن کاملEfficient Runtime Thread Management for the Nano-Threads Programming Model
The nano-threads programming model was proposed to effectively integrate multiprogramming on shared-memory multiprocessors, with the exploitation of fine-grain parallelism from standard applications. A prerequisite for the applicability of the nano-threads programming model is the ability of the runtime environment to manage parallelism at any level of granularity with minimal overheads. In thi...
متن کاملProcedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain
This paper introduces a method for improving program run-time performance by gathering work in an application and executing it efficiently in an integrated thread. Our methods extend whole-program optimization by expanding the scope of the compiler through a combination of software thread integration and procedure cloning. In each experiment we integrate a frequently executed procedure with its...
متن کاملImproving performance of optimized kernels through fast instantiations of templates
To fully exploit the instruction-level parallelism offered by modern processors, compilers need the necessary information available during the execution of the program. This advocates for iterative or dynamic compilation. Unfortunately, dynamic compilation is suitable only for applications where the cost of compilation may be amortized by multiple invocations of the same code. Similarly, the co...
متن کاملThread Integration for Error Detection and Performance
This paper presents a technique for integrating multiple threads of computation for simultaneous execution and which is well suited for fault-tolerant application programs. A post-pass compiler has been developed that is capable of taking the application program as the host thread and automatically integrating the additional code as a guest thread to produce a composite thread, the execution of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995